:1 Price
:2 Description provided by the critic.
library(tidyverse)#load tidyverse
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DataComputing)#load Datacomputing
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
library(leaflet)# load Leaflet map
library(tidytext)#load tidytext for word extraction
## Warning: package 'tidytext' was built under R version 3.6.3
options(warn=-1) # to get rid of some warnings that can be remove
-I would like to do an informative research on wine price which include appreciation and depreciation. What factor drives wine price of the wine up and down? What makes wine A more expensive that wine B?
-The data is from Kaggle.
-The data was collected by winemag a wine tasting magazine.
-Each case represents each wine.
-There are over 130 thousand cases.
-Wine region wine score wine price and variant
Wine<- read.csv(file = 'winemag-data-130k-v2.csv',header = TRUE, stringsAsFactors = FALSE) # Read the wine file to Wine
Wine
Prepare the data for further step
Find the average price of wine
Remove all the NA
wine2<-na.omit(Wine)#na.omit remove all the NA in price
avg<-
wine2%>%
group_by(country)%>% #we want to see the per country part
summarise (avg = mean(price),Score =mean(points)) #average of both price and points
avg
ggplot(data=avg,aes(x=avg,y=Score, ymin=80))+geom_point() # try to look for a correlation between price and score with the average of the country
finding 95 percentile of the wine price
Seems to have a corelation between price and points on the average of each country
wine2 %>%
group_by(country) %>%
summarise(quantile = scales::percent(c(0.95)),# scales to find the price for each percential and also score. I end up satisfy with the top 5 because of the score being average of 94 in the US and France(see the scale on Readme.md)
price = quantile(price, c(0.95)),score = quantile(points, c(0.95)))
Finding bottom 5
wine2 %>%
group_by(country) %>%
summarise(quantile = scales::percent(c(0.05)),# scales to find the price for each percential and also score. I end up satisfy with the bottom 5 because of the score being average of 84 in France and 83 in the US(see the scale on Readme.md)
price = quantile(price, c(0.05)),score= quantile(points, c(0.05)))
As we can see that the scale of the wine is not 0-100 points but start approximately 80
So when we create a graph for the wine we must set the scale to 80-100
p95<-
wine2 %>% filter(points > quantile(points, 0.95))
p05<-
wine2 %>% filter(points < quantile(points, 0.05))
#use warn because to amount of warning on top of the GGplot preventing me to vitualizing the data
warn<-ggplot(data=p95,aes(x=price,y=points,ymin=96, ymax=100))+geom_point()+facet_wrap(~country,ncol=4) + stat_smooth(method="auto")
suppressWarnings(print(warn))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
warn<-ggplot(data=p05,aes(x=price,y=points,ymin=80, ymax=100))+geom_point()+facet_wrap(~country,ncol=4)+ stat_smooth(method="auto")
suppressWarnings(print(warn))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is no linear corelation with the price and point but the average of top 5 and bottom 5 vary by between 20 to 70 usd to the higher point wine
Focus on the American wine Because
1. American wine can be found easily here in the United States
2. Can easily compare because in an area for example NAPA valley there is a production of both cheap and expensive wine
3. Good split between the top 5 and bottom 5 2793 and 2784
filtered95 <- p95 %>% # filter the US
filter(country == "US")
tokenized_comments95 <- filtered95 %>%
select(description, designation, points, price, province, region_1, variety, winery) %>%
unnest_tokens(word, description) %>% #take each word from the description to the word bank
anti_join(stop_words) %>%
filter(word != "wine") %>%
filter(province != "America")%>%
group_by(province, word) %>%
tally()
## Joining, by = "word"
tokenized_comments95 %>% glimpse() #show each word
## Observations: 10,894
## Variables: 3
## Groups: province [4]
## $ province <chr> "California", "California", "California", "California", "C...
## $ word <chr> "02", "03", "04", "05", "06", "064", "07", "08", "09", "1"...
## $ n <int> 1, 1, 2, 13, 9, 1, 7, 3, 9, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, ...
tokenized_comments95 %>%
group_by(province) %>%
top_n(15) %>%
arrange(desc(n)) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = factor(province))) +
geom_bar(stat = "identity") +#bar chart for top words of each region
theme(legend.position = "none") +
facet_wrap(~ province, scales = "free") +
coord_flip() +
labs(x = "Frequency",
y = "Top words",
title = "Top 5 Percent US wine",
subtitle = "")
## Selecting by n
filtered05 <- p05 %>%
filter(country == "US")
tokenized_comments05 <- filtered05 %>%
select(description, designation, points, price, province, region_1, variety, winery) %>%
unnest_tokens(word, description) %>% #take each word from the description to the word bank
anti_join(stop_words) %>%
filter(word != "wine") %>%
filter(province != "America")%>%
group_by(province, word) %>%
tally()
## Joining, by = "word"
tokenized_comments05 %>% glimpse()#show each word
## Observations: 8,123
## Variables: 3
## Groups: province [21]
## $ province <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ar...
## $ word <chr> "appealing", "apples", "aromas", "berry", "blanc", "blend"...
## $ n <int> 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
tokenized_comments05 %>%
group_by(province) %>%
top_n(15) %>%
arrange(desc(n)) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = factor(province))) +
geom_bar(stat = "identity") +#bar chart for top words of each region
theme(legend.position = "none") +
facet_wrap(~ province, scales = "free") +
coord_flip() +
labs(x = "Frequency",
y = "Top words",
title = "Bottom 5 Percent US wine",
subtitle = "")
## Selecting by n
As we can see the top 5 percent of wine from the United States are from the 4 states(California, New York, Oregon, and Washington)
latlong<- read.csv(file = 'statelatlong.csv',header = TRUE, stringsAsFactors = FALSE) #read the latitude longitude data to plot on leaflet
latlong
Start with Bottom 5 Wine
wineUS05<-
tokenized_comments05 %>%
inner_join(latlong,by = c("province" = "City"))#join the data so the city can have the coordinate
wineUS051<- wineUS05%>%
group_by(province)%>%
summarise(max(n),)
wineUS05 <-wineUS051%>%
left_join(wineUS05, by = c('province'='province', 'max(n)'='n'))#taking the wine data to the word count data
wineUS05<-
select(wineUS05,-c(State))
# Show the data map of the top word through the region
leaflet(data = wineUS05) %>% addTiles() %>%
addMarkers(~Longitude, ~Latitude, popup = ~word, label = ~word)
Top 5 wine
wineUS95<-
tokenized_comments95 %>%
inner_join(latlong,by = c("province" = "City"))#join the data so the city can have the coordinate
wineUS951<- wineUS95%>%
group_by(province)%>%
summarise(max(n),)
wineUS95 <-wineUS951%>%
left_join(wineUS95, by = c('province'='province', 'max(n)'='n'))#taking the wine data to the word count data
wineUS95<-
select(wineUS95,-c(State))
# Show the data map of the top word through the region
leaflet(data = wineUS95) %>% addTiles() %>%
addMarkers(~Longitude, ~Latitude, popup = ~word, label = ~word)
Seems like for even the bottom 5 the word is very similar to the top 5 but you can see that some word such as “faint, kicking, sour” is still appear in the top 5 but less so than the bottom 5
wineUS05 %>%
pivot_wider(names_from = word, values_from = 'max(n)')# using pivot wider to help visualize the important keyword
There are 42 different words from the Bottom 5 and most of it are NA due to the fact that most of the states only make one wine and it wasn't rated highly
The word that appear the most is flavors and palate
wineUS95 %>%
pivot_wider(names_from = word, values_from = 'max(n)')# using pivot wider to help visualize the important keyword
The top 5 wine is also have a lot of flavors and palate but also the 'fruit' which is absent in the bottom 5 I suspect that bottom 5 wine do have decent taste but lack the fruit flavor
filtered05 %>%
filter(grepl('faint|kicking|sour', description, ignore.case = TRUE))#regular expression to find the wine with the word describe
210 out of 2793 bottom wine contains which is 7.5 percent
filtered95 %>%
filter(grepl(' faint| kicking| sour', description, ignore.case = TRUE))#regular expression to find the wine with the word describe
140 out of 2784 top wine contains which is 5 percent
filtered05 %>%
filter(grepl('fruit', description, ignore.case = TRUE))#regular expression to find the wine with the word describe
728 out of 2793 wine contain which is 26 percent
filtered95 %>%
filter(grepl('fruit', description, ignore.case = TRUE))#regular expression to find the wine with the word describe
1140 out of 2784 wine contain which is 41 percent
We then try to compare the 2 word to word
compare_token <-tokenized_comments05%>%
inner_join(tokenized_comments95, by = c('province'='province', 'word'='word'))%>%
mutate(difference = (n.y/n.x))%>%#mutate the ratio of the top and bottom wine
arrange(desc(difference))
compare_token
The word such as develop, refined, perfect, delicius, elegance can be found much more promenently in the top wine
compare_token <-tokenized_comments95%>%
inner_join(tokenized_comments05, by = c('province'='province', 'word'='word'))%>%
mutate(difference = (n.y/n.x))%>%#mutate the ratio of the top and bottom wine
arrange(desc(difference))
compare_token
The word such as Thin, simple, un ripe, dull, harsh can be found much more promenently in the bottom wine
filtered05%>%
summarise (avg = mean(price),Score =mean(points)) #mean price and points
filtered95%>%
summarise (avg = mean(price),Score =mean(points))# mean price and points
Price difference of 55 usd and the score difference of 12
With the data from step 3 it seems to suggest that
1.Wine with top wine with high score tends to have a description that is more positive such as develop, refined, perfect, delicius, elegance.
Wine with lower quality such as the bottom 5 tend to have a very neutral to negative description Thin, simple, un ripe, dull, harsh
The word can also be on either the bottom or top so it's not a sure thing
2. Wine that are more expensive tend to score higher than the cheaper counter part (this could imply that the price affect the score because people might bias toward more expensive comodity) The price is not the sure indication of the quality as you can see that some cheap wine can score very highly and vise versa.